skip to main content


Search for: All records

Creators/Authors contains: "Truong, Quang Sang"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in a coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee the learnt embedding features are consistent with the captions semantics. Comprehensive experiments and extensive ablation studies on the ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms previous state-of-the-art methods in terms of accuracy and diversity. The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT. 
    more » « less
    Free, publicly-accessible full text available June 27, 2024
  2. The automatic classification of electrocardiogram (ECG) signals has played an important role in cardiovascular diseases diagnosis and prediction. Deep neural networks (DNNs), particularly Convolutional Neural Networks (CNNs), have excelled in a variety of intelligent tasks including biomedical and health informatics. Most the existing approaches either partition the ECG time series into a set of segments and apply 1D-CNNs or divide the ECG signal into a set of spectrogram images and apply 2D-CNNs. These studies, however, suffer from the limitation that temporal dependencies between 1D segments or 2D spectrograms are not considered during network construction. Furthermore, meta-data including gender and age has not been well studied in these researches. To address those limitations, we propose a multi-module Recurrent Convolutional Neural Networks (RCNNs) consisting of both CNNs to learn spatial representation and Recurrent Neural Networks (RNNs) to model the temporal relationship. Our multi-module RCNNs architecture is designed as an end-to-end deep framework with four modules: (i) timeseries module by 1D RCNNs which extracts spatio-temporal information of ECG time series; (ii) spectrogram module by 2D RCNNs which learns visual-temporal representation of ECG spectrogram ; (iii) metadata module which vectorizes age and gender information; (iv) fusion module which semantically fuses the information from three above modules by a transformer encoder. Ten-fold cross validation was used to evaluate the approach on the MIT-BIH arrhythmia database (MIT-BIH) under different network configurations. The experimental results have proved that our proposed multi-module RCNNs with transformer encoder achieves the state-of-the-art with 99.14% F1 score and 98.29% accuracy. 
    more » « less